ci: run tests on GPU server by JuArce · Pull Request #747 · yetanotherco/lambda_vm

JuArce · 2026-06-30T15:08:12Z

What

Adds a GPU test suite that runs in the merge queue and blocks the merge if it fails. It rents an RTX 5090 on Vast.ai, runs the CUDA tests there, and always destroys the box — reusing the rent → provision → run → always-destroy orchestration from benchmark-gpu.yml.

Why

CPU CI (pr_main.yaml) runs on GitHub ubuntu-latest runners, which have no GPU, so the CUDA backend is never exercised before merge. This suite runs the GPU-relevant tests on real hardware.

Test groups (run by `scripts/gpu_test.sh` via Makefile targets)

#	Target	Covers
1	`test-math-cuda`	GPU↔CPU kernel parity (NTT, LDE, barycentric, FRI, merkle, ext3, keccak, …)
2	`test-cuda-integration`	End-to-end GPU prove: every dispatch fires + the proof verifies
3	`test-cuda-fallback`	GPU dispatch error → CPU fallback still produces a verifying proof
4	`test-prover-cuda`	The `lambda-vm-prover`/`stark`/`crypto`/`ecsm` suite on the GPU path (CPU CI's sharded prover tests, run single-threaded with `--features cuda`)
5	`test-prover-comprehensive-cuda`	The comprehensive all-instructions prove (`test_prove_elfs_all_instructions_64_full`) on the GPU path

Groups 1–3 are GPU-only (no CPU equivalent); 4–5 mirror CPU CI's prover jobs but with the CUDA path enabled. All groups run even if one fails; the job fails if any group does.

How it works

Trigger: merge_group (one GPU rental per merge, not per push) + workflow_dispatch. A failure ejects the PR from the queue.
The box checks out the merge ref and runs scripts/gpu_test.sh, which:
- pins cudarc (CUDARC_PIN=cuda-12080) and verifies the GPU toolchain (prints full nvidia-smi),
- builds the guest ELFs (compile-programs-asm + compile-programs-rust),
- runs the 5 groups, aggregating failures.
Machine info is also printed in its own dedicated step (nvidia-smi, nvcc --version, CPU/RAM) before the test run, so the hardware is easy to find without scrolling the large test log.
Teardown always destroys the instance (--yes, 3× retry, plus a label-based sweep so a cancelled-mid-rent box can't leak).
Reporting: full stdout+stderr in the step log; the job summary lists the box specs, the failed tests grouped by suite, and the panic/assertion message for each.

Provisioning: price bands + retry across offers

The provisioning is a single Provision instance (retry across offers) step (ideas borrowed from IDP's create-vast-server.yml), designed so a slow or flaky host can't stall the merge gate:

Escalating price bands — PRICE_THRESHOLDS: "0.6 0.8 1.0" ($/hr, ascending). It picks the priciest (most reliable) offer in the cheapest band that has capacity, only climbing to a pricier band when a lower one is empty. The last value ($1) is the hard cap.
Retry across distinct offers — up to MAX_TRIES: 5 different physical hosts. Each host gets a PROVISION_TIMEOUT: 600s readiness budget (running + sshd accepting our key + onstart bootstrap finished); if it isn't ready in time, the box is destroyed and a different offer is tried.
Tried-host exclusion — every attempted host's machine_id is tracked and excluded from later scans, so a flaky host is never re-picked.
Fast-fail on unrecoverable states — an offer whose status_msg reports an image that can't be pulled / a host that can't meet the ask ("not started loading", "cannot be met", daemon errors) is abandoned immediately instead of burning the full 600s budget.
Transient scarcity — within each attempt, all bands are re-scanned up to OFFER_ATTEMPTS: 10 times (30s apart) before giving up, to ride out the small, fast-churning RTX 5090 pool.
Offer filter: RTX 5090, ≥16 cores, ≥48 GB RAM, ≥64 GB disk, cuda_max_good>=13.1, driver major ≥ 580.
Each swapped-out box is destroyed in-step (its recorded id cleared), and the final ready box is recorded for the always-run teardown — so no box leaks across the retry loop.

CUDA version requirement

An NVIDIA driver supporting CUDA ≥ 13.1 is required: the kernels are compiled with the toolkit's nvcc into PTX that the driver JIT-compiles at load — an older driver rejects it with CUDA_ERROR_UNSUPPORTED_PTX_VERSION. Enforced by the cuda_max_good>=13.1 offer filter and documented in the README "GPU Tests" section + crypto/math-cuda/build.rs.

Files

.github/workflows/gpu-tests.yml — new workflow (job gpu-tests).
scripts/gpu_test.sh — new on-box runner (cudarc pin + compile + 5 groups, aggregate exit).
Makefile — adds test-cuda-fallback, test-prover-cuda, test-prover-comprehensive-cuda (groups 1 & 2 targets already existed).
README.md — GPU Tests section + targets table.
crypto/math-cuda/build.rs — comment clarifying the PTX/driver version coupling.

Before this gates merges (manual GitHub settings — admin, not code)

on: merge_group: only makes the workflow eligible to run in the queue.

Add gpu-tests to the required status checks for main (Settings → Branches/Rulesets). Otherwise the job runs in the queue but a failure won't block the merge. Current required checks are only Lint and Test.

Sequencing: the gpu-tests check usually needs to have run once (via workflow_dispatch, or after this merges to main) before it appears in the required-checks picker. merge_group runs the workflow from the merge-commit ref (base + PR), same pattern as the existing merge-queue-only test-prover-comprehensive job.

Repo secrets VAST_API_KEY and VAST_TEMPLATE_HASH are already set.

Status

All 5 groups pass end-to-end on a rented RTX 5090. One real test-harness bug was fixed along the way: test-cuda-fallback ran its two tests in parallel, racing on the process-global GPU dispatch counters (assert_eq! count mismatches) — fixed by adding --test-threads=1 to that Makefile target.

Notes

Tradeoff: as a merge gate, a persistent Vast outage (no ready box after MAX_TRIES offers across all price bands) would block merges until it clears; the price-band climb, per-host readiness budget + retry-across-offers, and re-scan-for-scarcity loops mitigate it.

github-actions · 2026-06-30T19:19:11Z

Codex Code Review

No findings.

I reviewed the PR diff only: new GPU merge-queue workflow, GPU test script, Make targets, README additions, and the math-cuda build comment. I did not identify concrete safety/security, VM semantics, GPU resource, correctness, or significant performance issues in the changed code.

Verification was static only per instructions; I did not build or run tests.

claude · 2026-06-30T19:19:58Z

Review: GPU test suite in merge queue

This is an infra/CI PR (workflow + Makefile targets + scripts + docs). I verified the Makefile targets reference real features and test files (test-cuda-faults, cuda_fallback_tests, cuda_path_integration all exist), and the orchestration is well-built. Notable strengths: secrets kept out of the pip-install step, pinned Vast CLI commit, ephemeral per-instance SSH key, ref validation before remote bash -lc, pipefail so test failures aren't masked by tee, and a robust always-destroy teardown with a label-based leak sweep.

No safety, correctness, or performance issues found. Two minor readability nits (posted inline):

Low — stale comment: scripts/gpu_test.sh:12 says "All three groups run" but five groups now run.
Low — stale comment: scripts/gpu_test.sh:42 references a cuda_max_good>=12.8 offer floor; the new workflow's floor is 13.1. The 12.8 cudarc pin itself is correct — it's the host-side symbol floor and is independent of the PTX/driver-13.1 requirement — but the comment conflates the two.

Both are cosmetic; the workflow is sound. Note the PR's own caveat stands: as a required merge gate, a sustained Vast.ai outage (no offer after the retry loop) would block merges until it clears — an accepted tradeoff per the description.

MauroToscano · 2026-06-30T19:34:12Z

/bench-growth

github-actions · 2026-06-30T19:40:10Z

Benchmark — ethrex 20 transfers (median of 3)

_{Table parallelism: auto (cores / 3)}

Metric	main	PR	Δ
Peak heap	73337 MB	72870 MB	-467 MB (-0.6%) ⚪
Prove time	40.250s	40.084s	-0.166s (-0.4%) ⚪

✅ No significant change.

🔬 Looks like a small speedup (-0.4%) — below what 3 runs can confirm. Comment /bench-abba to run the drift-free ABBA tiebreaker (paired-t CI + exact Wilcoxon). Note: it occupies the bench server for ~30–40 min.
Optional pair count: /bench-abba 32 (20 resolves ~1%, 32 for ~0.6%).

✅ Low variance (time: 2.8%, heap: 1.9%)

Memory Growth

_{ethrex distinct-account transfers · default parallelism · 1 sample per point}

Transfers	main (MB)	PR (MB)	Δ
4	24560	24844	+284 MB (+1.2%)
8	38026	37895	-131 MB (-0.3%)
12	50757	48947	-1810 MB (-3.6%)
16	60593	61176	+583 MB (+1.0%)
20	72309	73056	+747 MB (+1.0%)

Growth rate: 2993 MB / transfer (main: 2952, Δ: +1.4%)
Fit: R² = 0.9995 (main: 0.9969)

✅ No significant change in memory scaling.

_{Commit: 1d08005 · Baseline: cached · Runner: self-hosted bench}

github-actions · 2026-06-30T19:49:36Z

AI Review

PR #747 · 5 changed files

Findings

Status	Sev	Location	Finding	Found by
candidate	low	.github/workflows/gpu-tests.yml:236	Hardcoded CUDARC_PIN/SYSROOT_DIR in the SSH remote string duplicate the script defaults	minimax minimax/MiniMax-M3
candidate	low	scripts/gpu_test.sh:42	Stale comment in scripts/gpu_test.sh still references the old `cuda_max_good>=12.8` offer floor	minimax minimax/MiniMax-M3

Status column reflects the verdict from the verifier: deepseek-verifier (openrouter/deepseek/deepseek-v4-pro).

AI-001: Hardcoded CUDARC_PIN/SYSROOT_DIR in the SSH remote string duplicate the script defaults

Status: candidate
Severity: low
Location: .github/workflows/gpu-tests.yml:236
Found by: minimax:minimax/MiniMax-M3
Verified by: -
Rejected by: -

Claim

The REMOTE string passed over SSH hardcodes CUDARC_PIN=cuda-12080 SYSROOT_DIR=/opt/lambda-vm-sysroot, while the same defaults are already declared inside scripts/gpu_test.sh (lines 22-23). If the script defaults change, the workflow will silently disagree with the script.

Evidence

.github/workflows/gpu-tests.yml line 236: CUDARC_PIN=cuda-12080 SYSROOT_DIR=/opt/lambda-vm-sysroot bash scripts/gpu_test.sh. scripts/gpu_test.sh lines 22-23: CUDARC_PIN="${CUDARC_PIN:-cuda-12080}" and export SYSROOT_DIR="${SYSROOT_DIR:-/opt/lambda-vm-sysroot}". Both default to the same values today, but the duplication means a future bump in either default must be made in two places.

Suggested fix

Drop the inline env from the SSH command so the script's defaults are the single source of truth, e.g. ... bash scripts/gpu_test.sh. Override only when intentionally diverging from the defaults.

AI-002: Stale comment in scripts/gpu_test.sh still references the old `cuda_max_good>=12.8` offer floor

Status: candidate
Severity: low
Location: scripts/gpu_test.sh:42
Found by: minimax:minimax/MiniMax-M3
Verified by: -
Rejected by: -

Claim

The explanatory comment about cudarc pinning says the pin "matches the cuda_max_good>=12.8 offer floor", but the new .github/workflows/gpu-tests.yml query now requires cuda_max_good>=13.1 (raised because the base image's nvcc is 13.1 and the build emits 13.1 PTX). This is stale documentation drift introduced by this PR.

Evidence

The comment on line 42 of scripts/gpu_test.sh reads: "# CUDA version (12.8, matching the cuda_max_good>=12.8 offer floor) avoids that." The new workflow in .github/workflows/gpu-tests.yml line 85 uses cuda_max_good>=13.1 with a comment explaining "the box's driver must support CUDA 13.1 because the template's nvcc is 13.1". The script's claim is no longer accurate and will mislead anyone reading the script after the offer floor moves again.

Suggested fix

Update the comment to either: (a) drop the offer-floor reference and just explain that the pin is a known-symbol set older than the newest cudarc; or (b) reference cuda_max_good>=13.1 consistently with the workflow query.

Reviewer Lanes

Lane	Model	Prompt	Status	Findings
glm	openrouter/z-ai/glm-5.2	general	error: opencode failed (provider/auth/runtime error) and no findings were submitted	0
kimi	openrouter/moonshotai/kimi-k2.7-code	general	error: opencode failed (provider/auth/runtime error) and no findings were submitted	0
minimax	minimax/MiniMax-M3	general	success	2
moonmath	zro/minimax-m3	general	error: agentic lane timed out after 1800s	0
nemotron	openrouter/nvidia/nemotron-3-ultra-550b-a55b	general	error: opencode failed (provider/auth/runtime error) and no findings were submitted	0

Verification Lanes

Lane	Model	Status	Confirmed	Rejected	Uncertain
deepseek-verifier	openrouter/deepseek/deepseek-v4-pro	error: opencode failed (provider/auth/runtime error) and no verifications were submitted	0	0	0

Native Codex and Claude reviews run separately and post their own comments. They are not included in this structured provenance report.

Raw lane outputs, candidates, final issues, and model metrics are uploaded as workflow artifacts.

…sts_gpu

# Conflicts: # Makefile

…, sed guard, doc nits (#777) - gpu-tests.yml: capture wait_ready's exit code in an else branch (rc=$? after 'if ...; fi' is always 0, so the timeout-vs-image-failure diagnostic never worked) - gpu-tests.yml: add ServerAliveInterval/CountMax to the test ssh session so a box that goes dark mid-suite fails in ~10 min instead of eating the 240-min job budget - gpu-tests.yml: upload the full test log as an artifact (the step log gets truncated in the UI for multi-hour runs, as the workflow itself notes) - gpu_test.sh: guard the cudarc-pin sed anchors (a silent no-op after a math-cuda Cargo.toml refactor would resurrect the fallback-latest driver-symbol panic) and restore the mutated Cargo.toml on exit so manual runs don't leave the tree dirty - Makefile: add test-prover-debug to .PHONY; correct the test-cuda-integration comment (the test asserts R1-R4 counters, not R1-R3); scope the comprehensive-cuda parity comment (CPU's merge-queue job also runs test_recursion_execute) - docs/roadmap.md: GPU FFT / Merkle tree / FRI are implemented and CI-tested, not 'Planned' - prove_elfs_tests.rs: fix 'cargo test --ignored' -> 'cargo test -- --ignored' in the ignore comment

ci: run tests on GPU server

d95e3c8

JuArce self-assigned this Jun 30, 2026

JuArce added 8 commits June 30, 2026 12:17

keep instance running for debug

70754b3

ci: require CUDA 13.1

0c3b895

ci: test prover on CUDA

9abe6f0

Merge branch 'main' into ci_run_tests_gpu

db3f7ce

ci: improve gpu failed tests summaries

cd83966

ci: test-threads=1

25904c9

remove temporary code

e862a66

Merge branch 'main' into ci_run_tests_gpu

1d08005

JuArce marked this pull request as ready for review June 30, 2026 19:17

JuArce added the ai-review Trigger the AI review label Jun 30, 2026

claude Bot reviewed Jun 30, 2026

View reviewed changes

Comment thread scripts/gpu_test.sh Outdated

claude Bot reviewed Jun 30, 2026

View reviewed changes

Comment thread scripts/gpu_test.sh Outdated

JuArce marked this pull request as draft June 30, 2026 19:48

JuArce added 4 commits June 30, 2026 17:00

fix: set cuda_max_good>=12.8

57e3241

comments

7d8241c

apply code review

764aade

Merge branch 'main' into ci_run_tests_gpu

8d70a0d

JuArce marked this pull request as ready for review June 30, 2026 21:28

JuArce added 5 commits July 1, 2026 12:06

dummy comment

35a4c0e

remove tmp code

919fe14

reduce to 48gb

d2a4cef

add retries

c118340

rent server with cuda 13.1

ddf542a

JuArce and others added 6 commits July 1, 2026 16:07

print nvidia-smi

7fc3662

print machine info in a separated step

4806684

Merge branch 'main' into ci_run_tests_gpu

05b938e

Merge remote-tracking branch 'origin/ci_run_tests_gpu' into ci_run_te…

da035de

…sts_gpu

remove temp code

224f093

Merge remote-tracking branch 'origin/main' into ci_run_tests_gpu

83af41d

# Conflicts: # Makefile

MauroToscano mentioned this pull request Jul 3, 2026

ci(gpu-tests): review fixes — rc capture, ssh keepalive, log artifact, sed guard, doc nits #777

Merged

MauroToscano added 2 commits July 3, 2026 18:01

Merge branch 'main' into ci_run_tests_gpu

a96fc2c

MauroToscano approved these changes Jul 3, 2026

View reviewed changes

MauroToscano merged commit 509fd3f into main Jul 3, 2026
1 check passed

MauroToscano deleted the ci_run_tests_gpu branch July 3, 2026 21:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ci: run tests on GPU server#747

ci: run tests on GPU server#747
MauroToscano merged 26 commits into
mainfrom
ci_run_tests_gpu

JuArce commented Jun 30, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 30, 2026

Uh oh!

Uh oh!

Uh oh!

claude Bot commented Jun 30, 2026

Uh oh!

MauroToscano commented Jun 30, 2026

Uh oh!

github-actions Bot commented Jun 30, 2026

Uh oh!

github-actions Bot commented Jun 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

JuArce commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Test groups (run by scripts/gpu_test.sh via Makefile targets)

How it works

Provisioning: price bands + retry across offers

CUDA version requirement

Files

Before this gates merges (manual GitHub settings — admin, not code)

Status

Notes

Uh oh!

github-actions Bot commented Jun 30, 2026

Codex Code Review

Uh oh!

Uh oh!

Uh oh!

claude Bot commented Jun 30, 2026

Review: GPU test suite in merge queue

Uh oh!

MauroToscano commented Jun 30, 2026

Uh oh!

github-actions Bot commented Jun 30, 2026

Benchmark — ethrex 20 transfers (median of 3)

Memory Growth

Uh oh!

github-actions Bot commented Jun 30, 2026

AI Review

Findings

Reviewer Lanes

Verification Lanes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JuArce commented Jun 30, 2026 •

edited

Loading

Test groups (run by `scripts/gpu_test.sh` via Makefile targets)